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Abstract 

(N 

I I To better understand the spatial structure of large panels of economic and financial time 



series and provide a guideline for constructing semiparametric models, this paper first consid- 
ers estimating a large spatial covariance matrix of the generalized m-dependent and /3-mixing 
time series (with J variables and T observations) by hard thresholding regularization as long 
as log JA'*(T)/r = 0(1) (the former scheme with some time dependence measure A'*(T)) or 
> log J/T = 0(1) (the latter scheme with the mixing coefficient ^,nix = 0{{J'^+^' ^/]ogJT)-^}, 6' > 

T— I 

CS| 0. We quantify the interplay between the estimators' consistency rate and the time dependence 

cn level, discuss an intuitive resampling scheme for threshold selection, and also prove a general 



o 



cross-validation result justifying this. Given a consistently estimated covariance (correlation) ma- 
^ ] trix, by utilizing its natural links with graphical models and semiparametrics, after "screening" 

^ the (explanatory) variables, we implement a novel forward (and backward) label permutation pro- 

cedure to cluster the "relevant" variables and construct the corresponding semiparametric model, 

H 



which is further estimated by the groupwise dimension reduction method with sign constraints. 
We call this the SCE (screen - cluster - estimate) approach for modeling high dimensional data 
with complex spatial structure. Finally we apply this method to study the spatial structure of 
large panels of economic and financial time series and find the proper semiparametric structure 
for estimating the consumer price index (CPI) to illustrate its superiority over the linear models. 
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1 Introduction 



1.1 Large Spatial Covariance Matrix 

Recent breakthroughs in technology have created an urgent need for high- dimensional data analysis 
tools. Examples include economic and financial time series, genetic data, brain imaging, spectroscopic 
imaging, climate data and many others. To model high dimensional data, especially large panels of 
economic and financial time series as our focus here, it is very important to begin with understand- 
ing the "spatial" structure (over the space of variables instead of from a geographic point of view; 
also used in future for convenience) instead of simply assuming any specific type of parametric (e.g. 
linear) model first. Estimation of large spatial covariance matrix plays a fundamental role here since 
it can indicate a predictive relationship that can be exploited in practice. It is also very important in 
numerous other areas of economics and finance, including but not limited to handling heteroscedas- 
ticity of high dimensional econometric models, risk management of large portfolios, setting confidence 
intervals (or interval forecasts) on linear functions of the means of the components, variable grouping 
via graphs, dimension reduction by principal component analysis (PC A) and classification by linear 
or quadratic discriminant analysis (LDA and QDA). In recent years, many application areas where 
these tools are used have dealt with very high-dimensional datasets with relatively small sample size, 
e.g. the typically low frequency macroeconomic data. 

It is well known by now that the empirical covariance matrix for samples of size T from a J-variate 
Gaussian distribution, Nj(yU, Sj) is not a good estimator of the population covariance if J is large. If 
J/T — 7- c G (0, 1) and the covariance matrix Hj = I (the identity), then the empirical distribution of 



the eigenvalues of the sample covariance matrix Sj follow the Marcenko-Pastur Law (Marcenko and 



Pastur (1967)) and the eigenvalues are supported on ((1 — y^)^, (1 + ^A^)^)• Thus, the larger J/T is. 



the more spread out the eigenvalues are. 

Therefore, alternative estimators for large covariance matrices have attracted a lot of attention 
recently. Two broad classes of covariance estimators have emerged. One is to remedy the sample 
covariance matrix and construct a better estimate by using approaches such as banding, tapering 
and thresholding. The other is to reduce dimensionality by imposing some structure on the data 



such as factor models. Fan et al. (2008). Among the first class, regularizing the covariance matrix by 
banding or tapering relies on a natural ordering among variables and assumes that variables far apart 



in the ordering are only weakly correlated, Wu and Pourahmadi (2003), Bickel and Levina (2008b), 



Cai and Zhou (2011) among others. However, there are many applications, such as large panels of 



macroeconomic and financial time series, gene expression arrays and other spatial data, where there is 
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no total ordering on the plane and no defined notion of distance among variables at all. These existing 
applications require estimators to be invariant under variable permutations such as regularizing the 



covariance matrix by thresholding, El Karoui (2008) and Bickel and Levina (2008a). In this paper, we 



consider thresholding of the sample spatial covariance matrix for high dimensional time series, which 
extends the existing work from the iid to the dependent scenarios. Under the time series setup, a very 
important question to ask is: how the time dependence will affect the estimate's consistency? This is 
the first question this paper is going to answer. 



For time series, there have been two recent works by Bickel and Gel (2011) and Xiao and Wu 



(2011) about banding and tapering the large autocovariance matrices for univariate time series. But 



our goal here is to better understand the spatial structure of high dimensional time series, so we need 
a consistent estimate of the large spatial covariance matrix, especially under a mixture of serial cor- 
relation (temporal dynamics), high dimensional (spatial) dependence structure and moderate sample 
size (relative to dimensionality). 



1.2 Relation with Semiparametric Model Construction 

As mentioned at the very beginning, when the spatial structure of the high dimensional data (time se- 
ries) is complex, instead of simply assuming any specific type of parametric (e.g. linear) model first, we 
could adopt the flexible nonparametric approach. Due to the "curse of dimensionality" disadvantage 
of full nonparametrics, various semiparametric models have been considered to maintain flexibility in 
modeling while attempting to deal with the "curse of dimensionality" problem. However, most of the 
prior semiparametric works were carried out under some (prefixed) specific classes of semiparametric 
models without discussing which ones might be closer to the actual data structure. More specifically, 
to model some dependent variable y (or xj) using explanatory variables Xi,X2, ■ ■ ■ ,xj_i (very large 
J — 1), they might suggest the following high dimensional single index (Huang et al. ( 2010[ )) or ad- 



ditive models (Meier et al. (2009), Ravikumar et al. (2009)) first and then perform various variable 



selection techniques to eliminate some x's to avoid overfitting. 

• E(y) = g{xi[3i + X2P2 + a^s/^s + X/^P^ + . . . -|- a;j_i/3j_i), where g is an unknown univariate link 
function, and /32, Pz-, ■ ■ ■ , Pj-i are unknown parameters that belong to the parameter space. 

• E{y) = giixi) + g2{x2) + gz{xz) + • • • + gj-i{xj.i), where gi,g2,g3, ■ ■ ■,gj-i are the unknown 
functions to be estimated nonparametrically. 

This approach encounters limitations from the following three perspectives. First, when the 
dimensionality J — 1 — t- 00, the prefixed assumption itself becomes more and more questionable. 



Is the single index model or the additive one closer to the actual data structure? Or maybe some 
other type of semiparametric structures is more suitable? We do not know. And this becomes more 
challenging when the sample size T is small (with respect to dimensionality). 

Second, this - prefixing some specific semiparametric classes first and then selecting variables 
accordingly - approach is also challenged by another character of high dimensional economic and 
financial time series: strong spatial dependence (near-coUinearity) . Under near-collinearity, we expect 
variable selection to be unstable and very sensitive to minor perturbation of the data. In this sense, 
we do not expect variable selection to provide results that lead to clearer economic interpretation than 
principal components or ridge regression. This is actually due to the fact that although compared 
with the information criteria based Lq and ridge regression type L2 regularization methods, the Lasso 
type Li variable selection techniques (Tibshirani ( 1996[ )) could deal with large J and require weaker 



assumptions on the design matrix x (composed of xi, . . . , it still requires the following (as one 



of many similar requirements) restricted eigenvalue (RE) assumptions from Bickel et al. (2009): there 
exists a positive number k = k{s) such that 



mm 



I : |7^| ^s,Ae M■^-^{0}, II Anc ||i^ 3 || An ||i | ^ 



where \TZ\ denotes the cardinality of the set TZ, TZ'^ denotes the complement of the set of indices TZ, 
and An denotes the vector formed by the coordinates of the vector A w.r.t. the index set TZ. It is 
essentially a restriction on the eigenvalues of the Gram matrix = x'^x/T as a function of sparsity 



s. To see this, recall the definitions of restricted eigenvalue and restricted correlation in Bickel et al. 



(2009): 



^min(^i) = _ , min ^ 2 , 1 ^ z ^ J 



12 



. . z^ ^!tZ 

^max(?^) = max I 12 , 1 ^ Z ^ J - 1, 

z&L-^ -l:l^M(z)^u \Z\^ 
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V^^,,^, = max I ^.''/f'l., : /i n ^2 = 0, |/.| ^ m„ /, G M"'\{0}, z = 1, 2), 
J- I/1I2IJ2I2 

where |/j| denotes the cardinality of Jj and x/- is the T x |Jj| submatrix of x obtained by removing 
from X the columns that do not correspond to the indices in Jj. Lemma 4.1 in Bickel et al. (2009) 
shows that if the restricted eigenvalue of the Gram matrix \E't satisfies '?/'min(2s) > 'iips,2s for some 
integer 1 ^ s ^ (J — l)/2. Assumption RE holds. Under this condition, the Lasso type estimate's 



various oracle inequalities could be derived, e.g. Bickel et al. (2009), where the upper bounds typically 



negatively depend on k. From an economic point of view, this in fact requires that the dependence 
can not be too strong, which, unfortunately, is often unsatisfied for large panels of macroeconomic 
and financial data. 
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Third, when the proposed high dimensional semiparametric model has a complex structure, finding 
a proper penalty term and the corresponding estimation method for variable selection in general might 
be very difficult, since, ideally, the penalty should depend not only on the coefficients, but also on 
the (shapes of the) unknown nonparametric link functions. Several examples of regularizing high 



dimensional semiparametric models could be found in Chapter 5 and 8 of Biihlmann and van de Geer 



(2011), Ravikumar et al. (2009) among others. 



To this end, developing a specific model free high dimensional spatial structure first and then 
constructing the right class of semiparametric models seems important. Specifically speaking, given 
xi, X2, . . . , xj-i and y (or xj), we try to find the index sets Ai, A2, ■ ■ ■ , As (possibly with overlapping 
elements) such that y could be well approximated by: 

S \As\ S 



s=l 1=1 s=l 



(1) 



where 



• I ■ I denotes the cardinality of the set ■; S is the number of index sets Ai,A2,--- ,As and also 
the number of the unknown univariate nonparametric link functions gi, ■ ■ ■ ,gs', 

• x_A^ =^ {xi, I G As) is a vector of regressors w.r.t. the index set As] Psi, l^s^S", 1^/^ \As\ 
are the unknown parameters in the parametric space; f3s = {(3si, ■ ■ ■ , f3s\As\)j 

• '^J 7^ 7^ t, G As, Xi G At, Xj and Xi are (conditionally) independent given other x's. 

If K =^ 1^1 U • • • U-^s'l ^ -'^ — 1 (although K is still possibly not moderate), we could strike a 
balance between dimension reduction and fiexibility of modeling. Model ([T]) is very general and 
includes the single index model if S* = 1 (Ichimura (1993)), the additive model if |^i| = \A2\ = . . . = 
1^5! = 1 (Hastie and Tibshirani (1990)), the partial linear model if S" = 2, gi is the identity function 
and 1^2! = 1 (Speckman (1988)) and the partial linear single index model if 5 = 2 and gi is the 



identity function ( |Ahn and Powelll ( [l993| , [Carroll et alT] ( [l995| , |Yu and Ruppert] ( [2002| ). Model Q 



could also be viewed as an extension of the multiple index model (Stoker (1986), Ichimura and Lee| 



(1991), Horowitz (1998), Xia (2008)) and can be further generalized if the RHS is /z{E(?/)} where n 



is some known link function. As also considered by Li et al. (2010), if ^1, ^2, • • • , As are disjoint (no 
overlapping elements), then for each group of variables x^^, we could say gs denotes (the only) one 
index. Thus according to Li et al. (2010), model ([T]) is identifiable as every subspace of every group 
x_A^ is identifiable and could be solved efficiently by the grouping dimension reduction method in Li 



et al. (2010), where they primarily assume that the grouping information is available. An immediate 
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question is that given Xi, . . . , xj_i ( J — 1 — > oo), how can we extract S groups of "relevant" x's with 
corresponding index sets Ai, . . . , As and \Ai U ■ • • U -^sl ^ -'^ — 1? This is the second question this 
paper is going to answer. From now on, we mainly study the case where Ai,A2i ■ ■ ■ , As are disjoint, 
although in Section |4] we will also present the method generating overlapping index sets. 

Before moving on, let us study the differences among various semiparametric models from the 
graphical point of view. If we use a vertex in the graph to represent a relevant variable, a solid edge in 
a "block" to represent linear relationship among variables inside, a bandy edge (connecting a "block" 
with the dependent variable y) to represent a nonparametric link function, a crossed vertex to represent 
an "unrelated" ones, then we can visualize different semiparametric models through corresponding 
graphs. For instance, we can get Figure [l] (left) for the single index model; Figure [l] (right) for the 
additive model; Figure [2] (left) for the (more general) multiple index model, among many others. 
As we can see, the underlying difference among various semiparametric models is where to allocate 
the nonparametric link function and linearity through clustering variables. Consequently, assuming 
that all the variables have been included (complete graph), if we can find the corresponding type of 
graphs, we can construct the right class of semiparametric models. Sparse concentration matrices are 
of special interest in graphical models because zero partial correlations help establish independence 
and conditional independence relations in the context of graphical models and thus imply a graphical 
structure. For example, if we have a sparse covariance matrix for y,Xi, . . . , Xg as the one in Figure |2] 
(right), we know that xi,. . . ,xq are "relevant" to y, and due to the "block" structure w.r.t. xi,X2,X3 
and Xi,X5, we can construct the following class of semiparametric models as a specific case of ([T]): 

Hy) = gii^if^i + X2f32 + 3:3/33) + fi'2(a;4/34 + 3:5/35) + g3{x(if3Q). (2) 

Now we have found the links among semiparametrics, graphical models and sparse large spatial 
covariance matrix. Thus consistently estimating the large sparse covariance matrix first and clustering 
the (explanatory) variables (or forming a block diagonal structure for the corresponding partition of 
the covariance matrix) are the key focuses. In this article, we assume that the grouping structure (or 
the corresponding covariance matrix) and parametric coefficients (3s are both time invariant to simply 
the study. 

Another related and potential application of clustering variables comes from group regularization 



(e.g. group Lasso, Yuan and Lin (2006)) in the modern sparsity analysis. Huang and Zhang (2009) 
show that, if the underlying structure is strongly group-sparse, group Lasso is more robust to noise 
due to the stability associated with group structure and thus requires a smaller sample size to meet the 
sparse eigenvalue condition required in modern sparsity analysis. However, other than the situations, 
e.g. multi-task learning, where we have clear background knowledge about how to group variables, in 



Xi — X2 — X3 




Figure 1: Left: E(y) = g{xi(3i + X2P2 + x^Ps), where g is an unknown univariate link function, and 
Pi, /32, Ps are unknown indices which belong to the parameter space. Right: E{y) = gi{xi) + (72(3^2) + 
gsix^), where gi,g2 and g^ are the unknown functions to be estimated nonparametrically. 
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Figure 2: Left: E{y) = gi{xif3i + X2f32 + xsPs) + g2{x4,f34 + X5f35) + gsixaPa), where gi,g2 and are the 
unknown functions to be estimated nonparametrically. Right: a sample of block diagonal structure 
after label permutation to the regularized large spatial covariance matrix. 
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general, it is hard to tell how to properly group the variables to make use of group regularization. An 



example could be found in a paper in preparation with Bickel, Song and Bickel (2011), where they 
discuss three types of estimates for large vector auto regression w.r.t. different grouping methods. To 
this end, proper "grouping" of the variables is also significant. 

For the semiparametric modeling in econometrics, people usually "group" the variables in a "rule 
of thumb" way. For example, to model the consumer price index (CPI - all items), they might 
subjectively group the variables "CPI - apparel & upkeep; transportation; medical care; commodities; 
durables; services" in the first group, "CPI - all items less food; all items less shelter; all items less 
medical care" in the second group; " Producer Price Index (PPI) - Finished Goods; Finished Consumer 
Goods; Intermed Mat. Supplies & Components; Crude Materials" in the third group, "Implicit Price 
Deflator (of Personal Consumption Expenditures) PCE - all items; durables; nondurables; services" 
in the fourth group, all other variables in the last group. Is this way of grouping closest to the actual 
data structure? Why not put "CPI - Durables; PCE - Durables" in one group and "CPI - Services; 
PCE - Services" in another group? We are going to provide a procedure of grouping these variables 
from a data-driving approach. 

In summary, the novelty of this article lies in the following two aspects. First, under the high 
dimensional time series situation, we show consistency (and the explicit rate of convergence) of the 
threshold estimator in the operator norm, uniformly over the class of matrices that satisfy our no- 
tion of sparsity as long as log J X*(T)/T = o{l) (for the generalized m-dependent time series; the 
meaning of X*{T) is presented later) or log J/T = 0(1) (for the /3-mixing process with the mixing 
coefficient jSmix = O { ( J'^'^^' -\/log JT) ~^},6' > 0. Furthermore, we quantify the interplay between the 
estimators' consistency rate and the time dependence level, which is novel in this context. There 
are various arguments showing that convergence in the operator norm implies convergence of eigen- 



values of eigenvectors. El Karoui (2008) and Bickel and Levina (2008a), so this norm is particularly 
appropriate for various applications. We also discuss an intuitive resampling scheme for threshold 
selection for high dimensional time series, and prove a general cross-validation result that justifies this 
approach. Second, we propose a SCE (screen - cluster - estimate) approach for modeling high dimen- 
sional data with complex spatial structure. Specifically, given a consistently estimated large spatial 
covariance (correlation) matrix, by utilizing its natural links with graphical models and semiparamet- 
rics and using the correlation (or covariance for the standardized observations) between variables as a 
measure of similarity, after "screening" the (explanatory) variables, we propose a novel forward (and 
backward) label permutation procedure to cluster the "relevant" (explanatory) variables (or to form 
a block diagonal structure for the regularized large spatial matrix) and construct the corresponding 
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semiparametric model, which is further estimated by the groupwise dimension reduction method (Li 
eFaL] ( |20l0l )) with sign constraints. 

It is noteworthy that the "screening" in Step 1, "clustering" in Step 2, and the "sign constraints" 



in Step 3 here are crucial for applying the groupwise dimension reduction method of Li et al. (2010) 
in the high dimensional situation. First, their method requires the use of the high dimensional kernel 
function, which faces some limitations when J ^ T. The Step 1 here help reduce the dimensionality 
from J T) to a more manageable level. Second, they primarily assume that the grouping infor- 
mation is available from the background knowledge, which is often not available from the typically 
(spatially) unordered high dimensional data sets. Although they also proposed an information crite- 
rion based grouping method, this - "trying" many different combinations of grouping - approach is 
very computationally intensive and less practical. The Step 2 here provides this grouping information 
from a data driven approach with feasible computation. Third, without adding the "sign constraints" , 
the signs of the estimated parametric coefficients might violate the economic laws (details presented 



in Section |5j). Overall, together with Li et al. (2010)'s very timely and stimulating work, we provide 
an integrated approach for modeling high dimensional data with complex spatial structure. 

The rest of the article is organized as follows. In the next section, we present the main notations 
of the thresholding estimator. The estimates' properties are presented in Section |3j In Section |4] 
we state the details of the SCE procedure and in Section [5] apply it to study the spatial structure of 
large panels of macroeconomic and financial times series and find the proper semiparametric structure 
for estimating the consumer price index (CPI). Section [6] contains concluding remarks with a brief 
discussion. All technical proofs are sketched in the appendix. 



2 Dynamic Large Spatial Covariance Matrix Estimation 



We start by setting up notations and corresponding concepts for covariance matrix S, which are 



mostly from Bickel and Levina (2008b) and Bickel and Levina (2008a). We write Amax(S) = Ai(S) ^ 
. . . ^ Xj(J^) = Amin(S) for the eigenvalues of a matrix E. Following the notations of Bickel and 



Levina (2008b) and [Bickel and Levina (2008a), we define that, for any ^ r, s ^ oo and a J x J 



matrix S, ||S||(r,s) == sup{||Sa; 



1}, where = Y.j=i 



In particular, we write 



= II S II (2,2) = maxi^j^j |Aj(S)|, which is the operator norm for a symmetric matrix. We also use 
the Frobenius matrix norm, W^Wp = Ylij'^ij — tr(SS^). Dividing it by a factor J brings ||S|||./J, 
which is the average of a set of eigenvalues, while the operator norm ||S||(2,2) means the maximum of 



9 



the same set of eigenvalues. Bickel and Levina (2008a) defines the thresholding operator by 



T^(S) [mijl{\mij\ ^ s)], 

which we refer to as S thresholded at s. Notice that preserves symmetry; it is invariant under 
permutations of variable labels; and if HT,— To|| ^ e and Amin(S) > e, it preserves positive definiteness. 

We study the properties of the following uniformity class of covariance matrices invariant under 
permutations 

J 

Ur{q,Co{J),M) ''^ {S : a,, ^ M, J] ^ Co(J),W}, ^ g < 1. 

j=i 

We will mainly write Cq for Cq{J) in the future. Suppose that we observe T J-dimensional observa- 
tions Xi, . . . ,Xt with EX = (without loss of generality), and E(XX^) = S, which is independent 
of t. We consider the sample covariance matrix by 

T 

E T'' Y.{Xt - X){Xt - Xy [a.,], (3) 



t=i 



withX = T-^Zj=iXt 



Let us first recall the fractional cover theory based definition, which was introduced by Janson 



( 2004 ) and can be viewed as a generalization of m-dependency. Given a set T and random variables 



Vt, t G T, we say: 

• A subset T' of T is independent if the corresponding random variables {Vt}t£T' are independent. 

• A family {Tj}j of subsets of T is a cover of T if IJj 75 ~ 

• A family {(7j, Wj)}j of pairs {Tj, Wj), where Tj '^T and Wj G [0, 1] is a fractional cover of T if 

Wjlr^ ^ Ir, i.e. Y^j-.teTj '^i ^ ^ t e T. 

• A (fractional) cover is proper if each set Tj in it is independent. 

• X{T) is the size of the smallest proper cover of T, i.e. the smallest m such that T is the union 
of m independent subsets. 

• X*{T) is the minimum of Ylj "^j over all proper fractional covers {(TJ, Wj)}j. 

Notice that, in spirit of these notations, X{T) and X*{T) depend not only on T but also on the 
family {Vt}t(^r- Further note that X*{T) ^ 1 (unless T = 0) and that X*{T) = 1 if and only if the 
variables Vt,t G T are independent, i.e. X*(T) is a measure of the dependence structure of {Vt}t£T- 
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For example, if Vt only depends on V^_i, . . . , Vt^k but is independent of all {Vs}s<t-k, we will have 
k + 1 independent sets: 



Tl — {Vi, V(fc+i)+i, V2{k+i)+i, ■ ■ ■}, 

T2 = {V2, V(fc+i)+2, V2{k+l)+2, ■ ■ •}, 



s.t. [j'^tl Tj = T. X*{T) = k + l{iik + l< T). 

Besides the generalized m-dependent process, we are also going to consider the /3-mixing pro- 
cess, which is related to the underlying measures of dependence between cr-fields. More precisely, 
let P) be a probability space and W, V be two sub a-algebras of A, the /3-mixing coefficient 

V) = Eesssup{ \ P(y/U — P(V"))|; G V} be a measure of dependence between U and V, which 



has been defined by Kolmogorov and first appeared in the paper by Volkonskii and Rozanov (1959). 



By its definition, the closer to /3 is, the more independent the time series is. For examples of 



the /3-mixing process, we refer to Doukhan (1994). Through this article, we use Pmix to denote the 



/3-mixing coefficient for notational convenience. 



3 Estimates' Properties 



We have the following two results which parallel those in Bickel and Levina (2008b) and Bickel and 



Levina (2008a) 



3.1 Interplay Between Consistency Rate and Time Dependence Level 



def 



THEOREM 3.1 (Dependence level affects consistency?) Suppose for alii, j , \XtiXtj\ =' \Vt\ ^ 
Mt holds with a high probability and Ylt=i^t/'^ bounded by some constant C . Then, uniformly 
on Ur{q,CQ{J), M) , for sufficiently large M' also depending on C , if 



St = M'{C' 

and log J X*{T)/T = o{l), then 

\\Ts^{±)-E\\ = Op 

j-'\\Ts^i±)-i:\\l = Op 



hgJX*{T) 



co(J) 



Co 



{J){ 



log jA'*(r)] (i-9)/2 
f 

log jA'*(r)]i-5/2 
f 
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Not surprisingly, this theorem states that if we use the hard thresholding method to regularize the 
large sample covariance matrices, the consistency rate gets slower when the dependence level {X*{T)) 



increases, or in other words, the rate is maximized when X*(T) = 1, same as what Bickel and Levina 



(2008a) shows for the i.i.d case. When X*{T) reaches T, it will be offset by T in the denominator. 
The intuition behind is clear: if dependence is strong, then additional information brought by a 
"new" observation will be effectively less, i.e. the overall information from T observations will be less 
correspondingly, which will result in a slower consistency rate. On the other hand, according to the 
\ogJ X*{T) IT = 0(1) requirement, when the dependence level X*(T) increases, J must decrease and 
T must increase to retain the same amount of information. 

A very natural question to ask next is: to what extent, the degree of dependence (in terms of 
^-mixing coefficients) is allowed, while the consistency rate is still the same as the i.i.d. case, i.e. to 
study the relationship among high dimensionality R, moderate sample size T and /3-mixing coefficient 

ASSUMPTION 3.1 Al Vt, EXuXtj = 
A2 3a^, Vn, m, m"^ E{XniXnj + . . . + Xn+m,iXn+m,jy ^ 
A3 \/t, \XuXtj\ <: M 

THEOREM 3.2 (Balance '■'■J,T, (3'''' to achieve "good" consistency rate) Assume the (3 -mixing 



sequence {XtiXtj}J^-^^ satisfies Assumption 3.1 yi,j with a high probability. Then, uniformly on 



Ur{q, Co{J), S), for sufficiently large M' also depending on a^, M, if st = M'(cr^, M)^/ log J/T 
0(1) and the (5-mixing coefficient jSmix = O { ( J^^*^' i/log JT) ^ ^ } , 5' > 0, we have: 

'log J\(i-3)/2- 



||r,,(E)-E|| = Op{c„(j)(!^)>'-'"^} 
j-'||r.,(E) - = Op{c„(j)(!2i:^)'""'} 



As we can see, when dimensionality J increases, since the /3-mixing coefficient is controlled by 
0{ ( J^^*^' v^log JT) ^ ^ } , 5' > 0, the dependence level must decrease at the rate of (skipping the 
slow varying logs). When J is very large, this means "nearly" independent, which again confirms the 
result from the previous theorem. 

3.2 Choice of Threshold via Cross Vahdation 

Choices of threshold play a fundamental role in implementing this estimation procedure. We choose 



an optimal threshold by a cross-validation procedure as in Bickel and Levina (2008a) and Bickel and 
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"Thresholding" Set Qi| "Population" Set Q2 

Figure 3: Illustration of the cross-validation method. 



Levina (2008b). In particular, we divide the data set Q of size T into two consecutive segments, Qi and 
O2 of size Tl and T2 respectively, where Ti is typically about T/3. Then we compare the regularized 
(via thresholding) "target" quantity Ts(T,i^y), estimated from Qi, with the "target" quantity T,2,v, 
estimated from ^2- Hence S2,t, can be viewed as a proxy to the population "target" quantity S. The 
subindex v in Ts(Si^t,) and T,2,v indicates values from the fth split from a total of repeats. The 
optimal threshold is then selected as a minimizer (w.r.t. s) of the empirical loss function over N 
repeats, i.e. 

gminiV-i J2 II^^(^m) - ^2,.|||. (4) 



N 

arj^ 

Similarly, the oracle threshold is then selected as a minimizer w.r.t. s of the oracle loss function over 



N repeats, i.e. 

argminE ||rs(Ei,^) - E2,t,||^. (5) 

s 

Since the data are observed in time, the order of Xt is of importance, and hence a random split 
of Q to ^1 and ^2 is not appropriate in a time series context. Alternatively, we randomly select a 
consecutive segment of size Ti + T2 as Qi |J Q2 from the data set fl first, and then take the first third 
of fiiljf22 as fii (T2 ~ 2Ti) and the remaining two thirds as il2- Figure [3] provides an illustration 
for the cross-validation procedure. We repeat this times as before. Our goal now is to show that 
the rates of convergence for the empirical loss function and the oracle loss function are of the same 
order and hence, asymptotically the empirical threshold s performs as well as the oracle threshold sq 
selection. 

Our theoretical justification is based on adapting the results on the optimal threshold selection 
in Bickel and Levina (2008a) and the optimal band selection in Bickel and Gel (2011) to a case of 



optimal choice of a threshold for high dimensional /3-mixing time series. Let Wi, . . . , Wn, ■ ■ ■ , Wn+B 

•jpx\, X e R\vp G R-^, \\vp\ 



be X 1 vectors with common mean E W. Let llxIL = max„=i ... p \v'x\, x G M"', Vp G M"^, ||fp|| = 1 
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and Wb = B ^ J2p=i ^n+p- Then the empirical and oracle estimates based on Wk are defined as 



def 



fi^ = argmin \Wb — fip\ 
p=i,...,p 



def 



/i" = argmin | E — 

p=i,...,p 



(6) 
(7) 



respectively, where fip is estimated using Wi, . . . , Wn- 



We use Theorem 3 in Bickel and Levina (2008a) as Lemma 3.1 here, which states a result on 



asymptotic relation between the empirical and oracle estimates fi^ and fi°. 



LEMMA 3.1 (Theorem 3 in Bickel and Levina (2008a)) If the following assumptions (A4, A5, 
A6) are satisfied 

A4 |/i°-Eiy|2 = fip(r„); 



1; 



A5 Emaxp=i,...,p \\{v„ - ^ Cp{P) for e , \\vp 

A6 p{Pn) = o{rn), 
then we have 

If,- - Wb\^ = -EWl^l + o(l)} = ^]p(r„). 

Without loss of generality, assume that the number of repeats = 1. Notice that the empirical 
estimates Ts(Si t,) and T,2,v play the role of fip and W here respectively. Hence, if we can verify the 



conditions of Lemma 3A, we can apply it to justify the choice of a threshold by cross-validation and 
show that such regularized covariance matrix of high dimensional time series {Xt}, with an empirical 
selected threshold, asymptotically coincides with the regularized estimate selected by oracle. To this 



end, we also need the auxiliary Lemma 3.2 



LEMMA 3.2 Assume that Vt is white noise satisfying Evt = 0,Ev^ = a'^ and E \vt\^ ^ C < oo /( 



or 



(5 > 2. Let ||V^||f = 1- For the ^-mixing process satisfying the conditions of Theorem 3.2, we have 

P [j-^\tr{ytB - ^ ^ Kie^-^{-K2S^B) 

max (\tr{vjtB - E(fjS)}|) ^ C{q,Co, M)^\ogP / B 
p=i,...,p \ / 

with some constants Ki and K2- 

THEOREM 3.3 (Consistency of Cross Validation) Let s and s° be the threshold selected from 
minimizing the empirical and oracle loss functions Q and ^ respectively. Then under the conditions 



of Theorem 3.2 and Op = fip, if Bt = Te{T,J), hgP = o{T''^^co{J)J'\\ogjy~'i/^e{T,J)}, based 



on Lemma 3.1 and 3.2, then 



\Tsi±) - s||p = ||r,o(s) - s||p{i + op(i)}. 
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4 The Screen - Cluster - Estimate (SCE) Procedure 

To circumvent the problems in semiparametric modeling for high dimensional data with complex spa- 
tial structure, in the following three subsections, we state the three-step SCE procedure for construct- 
ing and estimating semiparametric models from a large number of unordered explanatory variables 
with a moderate sample size. 



4.1 Screen 

1 Estimate the J x J (dependent variable y (xj) and all explanatory variables xi, . . . large 
covariance (Spearman's correlation) matrix using hard thresholding as T§(S) '= and only 

keep and consider the (say K) x's with nonzero correlation entries with y for following steps. 
Without loss of generality, we rename the K x's as xi, X2, . . . , xk- 

Since all observations are standardized first, the previously considered covariance matrix is actually 
the (Pearson's) correlation coefficient matrix. However, at the "screening" step, we estimate and 
threshold the large Spearman's rank correlation matrix, where Spearman's rank correlation between 
Xi and Xj is defined as: 

^ Coy{Fi{xi),Fj{xj)} 
^"'^^ v/Var{F,(x.)}Var{F,(a:,)}' 
and Fi and Fj are the cumulative distribution functions of Xi and Xj respectively. It can be seen 

that the population version of Spearman's rank correlation is just the classic Person's correlation 

between -Fj(xj) and Fj{xj). Here we consider Spearman's rank correlation instead of the Pearson's 

correlation coefficient is because the latter one is sensitive only to a linear relationship between two 

variables, while the former one is more robust than the Pearson's correlation - that is, more sensitive to 

nonlinear relationships. It could be viewed as a non-parametric measure of correlation and especially 

suitable for the non and semiparametric situations we consider here. It assesses how well an arbitrary 

monotonic function could describe the relationship between two variables. Specifically speaking, it 

measures the extent to which, as one variable increases, the other variable tends to increase, without 

requiring that increase to be represented by a linear relationship. If, as the one variable increases, the 

other decreases, the rank correlation coefficients will be negative. Similar to the consistency results 



towards the large spatial thresholding covariance (correlation) matrix studied here, Xu and Bickel 



(2010) established those for the large Spearman's rank correlation matrix (for the i.i.d. case). 

At step 1, via hard thresholding, we single out the important predictors by using their Spearman's 
rank correlations with the response variable y and eliminate all explanatory variables that are "irrel- 
evant" to y. In light of equation ([T]), we actually get an estimate for |J . . . IJ^s- Thus we could 
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reduce the feature space significantly from J to a lower dimensional and more manageable space. 
Correlation learning is a specific case of independent learning, which ranks the features according to 
the marginal utility of each feature. The computational expediency and stability are prominently 



featured in independent learning. This kind of idea is frequently used in applications (Guyon and 



Elisseeff (2003)) and recently has been carefully studied for its theoretical properties by Fan and Lv 



(2008) using Pearson's correlation for variable screening of linear models; Huang et al. (2008) , who 
proposed the use of marginal bridge estimators to select variables for sparse high dimensional regres- 



sion models; Fan et al. (2011) using the marginal strength of the marginal nonparametric regression 



for variable screening of additive models; Hall and Miller (2009) using the generalized correlation for 
variable selection of linear models. 

It is also worthy noticing that the threshold is a global measure (implicitly) depending on all J 
variables. If we remove some x's from the original explanatory variables set, the threshold value will 
be changed correspondingly. Thus the "relevant" and "irrelevant" regressors will also change. 



4.2 Cluster 

Motivated by the fact that in a block diagonal matrix, the nonzero entries along the diagonal are 
denser than those in the off-diagonal region and the assumption w.r.t. equation ([T|: "Vj 7^ /, Xj G Aj^ 
xi G Al, Xj and xi are (conditionally) independent given other x's", we define the following "averaged 
non-zero" score 5^ for a index set A: 5*^ *= J2i jeA'^i^v 7^ O)/!-^!^- Here we do not distinguish 
between the positive and negative values of dij since they could also be reflected by the corresponding 
linear coefficients as in equation ([TJ. 

2 Perform the label permutation procedure for xi, . . . , xk to form clusters of (explanatory) vari- 
ables (or ^1, . . . ,As) by utihzing the "averaged non-zero" score S"^. 

2.1 Rank (in decreasing order) and relabel all Xi, . . . ,Xk, ■ ■ ■ ,xk according to J2i^j^K ^i^kj 7^ 0) 
to obtain the "new" xi, . . . ,xk- Always assume xi is in the first block (index set) Ai, 

2.2 Forward Include Xk {2 ^ k ^ K) in the first index set {xk G ^1) if •S'^iUii^j} ^ 
continue searching until the Kth variable xk- Without loss of generality (otherwise just 
relabel them), we assume Xi, X2, ■ ■ ■ , Xk-i G Ai. 

2.3a (For the case of no overlapping indices among Ai, . . . ,As) 

Given Ai formed in the last step, perform Steps 2.1 and 2.2 again for the variables not in 
the set Al, i.e. {1,2,..., K}\Ai and construct A2- 
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2.3b Backward (replace Step 2.3a, for the case allowing overlapping indices among ^i, . . . ,As) 
Given Ai, perform Step 2.1 again for the variables not in the set Ai, i.e. {1,2,..., K}\Ai 
and start to construct A2, for example, Xk G A2- Let Xi E Aif] A2 only if S{xi}\jA2 ^ 
and continue searching until Xk-i- Notice that it is impossible for all Xi, . . . ,Xk-i G A2 
because of the way we construct Ai in the forward step. Continue to construct A2 as in 
the forward step by selecting variables w.r.t. {1, 2, ... , IJI^^fc}- 

2.4 Continue this procedure until all variables xi, . . . , xk have been included into some index 
set(s) of Ai, . . . , As, where S is the number of selected index sets and ^1 IJ • • • U = 
{1, 2, . . . , K}. Given these, construct the corresponding semiparametric models by equation 

0- 

At Step 2, if we can permute the variables' labels to have a block diagonal structure for the 
partition of the consistently estimated covariance matrix, such as the one in Figure |2] (right), we can 
construct the corresponding class of semiparametric models as specific cases of equation ([T|. By this 
step, we are grouping the (explanatory) variables into highly correlated groups, which are, however, 
weakly correlated with each other. This "independence" property (between the "new" predictors 



i^J ^As^ 1 ^ s ^ 5*) is actually also required for the groupwise dimension reduction method of Li et al. 



(2010) for estimation. Except that we use the Spearman's rank correlation instead of the Pearson's 



correlation for "screening", a second difference between this work and Fan and Lv (2008) is that we 
consider the covariance (correlation) matrix for all xi, . . . ,xj-i,y variables instead of just between 
y and Xi, . . . ,a;j_i, which is because we need to further group the relevant explanatory variables for 
semiparametric model construction at the second step. 

A very important feature of the proposed label permutation procedure is that it is based on the 
thresholding regularized covariance matrix instead of the sample one. A related work which tries to 
discover the ordering of the variables through the metric multi-dimensional scaling method could be 



found in Wagaman and Levina (2009) for the i.i.d. Gaussian case. Their ultimate goal is to improve 
covariance matrix estimation rather than order the variables itself. Thus by utilizing the discovered 
"order" based on the sample covariance matrix, they estimate the large covariance matrix through 
banding regularization to enjoy the benefits brought by ordering. But in the case of large panels 
of economic and financial variables as we consider here, our ultimate goal is to cluster the variables 
to construct the proper semiparametric models instead of "ordering". For example, in the multiple 
index model, the order of the first index (or first cluster of variables) and the second index (or second 
cluster of variables) and the order of variables inside each "cluster" are both unimportant. 
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This case is also related to the hierarchical clustering, k-means algorithm and correlation clustering 



problem in computer science (Demaine and Immorlica (2003), Bansal et al. (2004)), which aims to 



partition a weighted graph with positive and negative edge weights so that negative edges are broken 
up and positive edges are kept together. However, the correlation clustering algorithm is also based on 
the sample correlation(s), and has also been shown to be NP-hard. Thus, as a key difference with other 
works in the literature, instead of using the sample covariance (or correlation) matrix for ordering 



and clustering as Wagaman and Levina (2009), Demaine and Immorlica (2003) and Bansal et al. 



(2004) did, we implement thresholding regularization for the sample covariance matrix and screening 



first and then find the corresponding groups through the stepwise label permutation procedure. It is 
simpler to be implemented than their's, since the thresholding regularized covariance matrix only has 
limited number of nonzero entries. By doing so, w.r.t. the regression setup, we also simultaneously 
extract the "relevant" explanatory variables for y (Step 1). Thus, we actually combine dimension 
reduction and variable clustering, which is especially suitable for modeling high dimensional data via 
semiparametric methods. 

This procedure is computationally simple for a typical J ^ 150 macroeconomic and financial data 
set since the thresholding regularization procedure removes J —1 — K "irrelevant" variables first, and 
then rank the remaining K ones before entering this label permutation procedure. Thus we avoid the 
NP-hard correlation clustering problem based on the sample covariance matrix. 

4.3 Estimate 

3 Groupwise dimension reduction with sign constraints. 



For Step 3, we implement the groupwise dimension reduction estimation procedure modified from Li 
et al. (2010). If we implement their method directly, as we can see from Table [l] (details of data 
presented later), the Spearman's rank correlations between xi, . . . , xk and y are all positive, however, 
some of their corresponding parametric coefficients are estimated to be negative (details presented later 
in Table |2|. This means that the consumer price index negatively depends on them, which is unlikely 
to be true from an economic point of view. Since we have disjoint groups of variables here and given 
the meaning of the Spearman's rank correlation, ideally, the sign of the corresponding parametric 
coefficient estimate w.r.t. Xk, 1 ^ k ^ K should be the same as the sign of the corresponding 
Spearman's rank correlation. This motivates us to add the sign constraint, as a refinement, to the 



groupwise dimension reduction method developed in Li et al. (2010) to secure the sign consistency. 
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Let us first consider a simple linear regression model Ey = xiPi+, ... ,+xkPk *= x'^ P with 



the constraint Pi,...,Pk ^ 0. The linear coefficient /3 could be estimated as the minimizer of 

La{ 

def 



\y — x^f3\\l/2 — '^^^^ Xkf3k with the corresponding nonnegative Lagrange multipliers A^'s, 1 ^ k ^ K. 



If we denote diag[Ai, . . . , A/^] by A, /3 = (x'^x) ^{x^y + A) = (3ols + (x^x) ^A. Intuitively, in 
case some entry of (3ols, say the kth, is negative, which contradicts the initial requirement (3k ^ 0, 



x'^x) 



~^\k plays the role of adding a positive increment to it, s.t. f3k ^ 0. 



Similarly, in our setup, if we use cr^j to denote the Spearman's rank correlation estimate between 
Xk and y (xj) extracted from Ts(S) and add the sign constraint sign{akj)f3k ^ to the estimation 



procedure of Li et al. (2010) {Ps here corresponds to their j3g), a simple calculation shows that we just 
need replace their estimation equation (15) for /3 =^ . . . , Pk)~^ by (/3 here corresponds to their (): 

T T _^ n n 

i=i j=i i=i j=i 

T T _^ 



(9) 



i=i j=i 

where A' is a x K diagonal matrix diag[Aisi(y'n(5"ij), . . . , Xxsign^aKj)]; {\k, I ^ k ^ K} are the 
hyperparameters; ( and other variables are the same as in equation (15) of . 



Li et al. 



(2010). In general. 



selection of A's requires minimizing some loss function. Motivated by the discussion above for the 
simple linear regression case, when sign{C,k) is the same as sign{akj), we simply choose A^ = 0, 
otherwise choose A^ to be the minimum (positive) value s.t. f3k = 0. By our experience, this works 
well and the convergence of the iterative estimation procedure is achieved within 19 iteration steps 
(10~^ as the tolerance) for modeling CPI, which is to be presented in Section [s] Then by the property 
of the convex minimization problem, if a local minimum exists, it is also a global minimum. 



Overall, similar to Fan and Lv (2008)'s "screen ffist; fit later" approach for modeling high di- 
mensional data, ours could be considered as the "screen ffist; group second; ffi third" approach. 



Alternatively, Bickel et al. (2009) and Meinshausen and Biihlmann (2006) consider the "ffi ffist; 
screen later" approach. In general, a great deal of work is needed to compare "screen ffist; ffi later" 
type of methods with "ffi first; screen later" types of method in terms of consistency and oracle 
properties. But when the spatial structure is complex (thus we need deviate from linearity), in terms 
of semiparametric modeling, as we have discussed in Section [T| the later one might face several main 



limitations, while ours, together with the estimation method modified from Li et al. (2010), as a 
special case of the former one, could circumvent these issues and would be faster when dealing with 
higher dimensionality. 
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Figure 4: Sample and regularized covariance matrices (after multiplying each entry's value by 100). 

5 Application 



We use the dataset of Stock and Watson (2005). This dataset contains 131 monthly macro indicators 



covering a broad range of categories including income, industrial production, capacity, employment 
and unemployment, consumer prices, producer prices, wages, housing starts, inventories and orders, 
stock prices, interest rates for different maturities, exchange rates, money aggregates and so on. The 
time span is from January 1959 to December 2003. We apply logarithms to most of the series except 
those already expressed in rates. The series are transformed to obtain stationarity by taking (the 
1st or 2nd order) differences of the raw data series (or the logarithm of the raw series). Then all 
observations are standardized. 
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Table 1: Partition of the Ts(T,i^y) w.r.t. CPI and the 15 "relevant" variables and the diagonal entries 
denote the indices (in the original data set) of the corresponding variables. 



20 



Variable 


Meaning 


Coefficients 


Coefficients 


PWFCSAiQ^ 


Producer Price Index: Finished Goods 


0.150 


0.207 


PWIMSAiog 


PPI: Finished Consumer Goods 


0.741 


0.763 


PWCMSAuo 


PPI: Intermed. Mat. Supphes & Components 


0.074 


0.196 




CPI-U: Transportation 


0.307 


0.304 


PUCns 


CPI-U: Commodities 


0.290 


0.235 


PUXF^2i 


CPI-U: All Items Less Food 


0.397 


0.378 


PUXHS122 


CPI-U: All Items Less Shelter 


-0.034 





PUXM123 


CPI-U: All Items Less Medical Care 


0.246 


0.217 


GMDC12A 


PCE,IMPL PR DEFL:PCE 


-0.146 





GMDCNuG 


PCE,IMPL PR DEFL:PCE; Nondurables 


-0.061 





PUCDn9 


CPI-U: Durables 


-0.639 


0.635 


GMDCD125 


PCE,IMPL PR DEFL:PCE; Durables 


-0.770 


0.772 




CPI-U: Services 


-0.994 


0.979 


GMDCS127 


PCE,IMPL PR DEFL:PCE; Services 


0.111 


0.203 


PU83u5 


CPI-U: Apparel & Upkeep 


1 


1 



Table 2: Detailed meanings of the variables with the corresponding parametric coefficients' estimates 
using the groupwise dimension reduction method without (3rd column) and with (4th column) sign 
constraints. 

Figure |4] contains plots of the sample and thresholding regularized Spearman's rank correlation 
matrices based on the "optimal" threshold 0.13 selected by the cross validation procedure discussed 
in subsection |3.2| with Ti = 120, T2 = 240. The variables of special interest include the consumer price 
index (CPI) as a measure of prices and an economic indicator. The annual percentage change in CPI 
is used as a measure of inflation. CPI can be used to index (i.e., adjust for the effect of inflation) 
the real value of wages, salaries and pensions, and also for regulating prices and deflating monetary 
magnitudes to show changes in real values. Besides being a deflator of other economic series, it is also 
a means of adjusting dollar values. Thus CPI is one of the most closely watched national economic 
statistics. To this end, we use modeling of CPI to illustrate our method. Table [T] displays the partition 
of the TsiTii^y) w.r.t. the variables relevant to CPI. The detailed meanings of these variables (and 



corresponding three digits indices in the data set provided by Stock and Watson (2005)) are given 
in Table |2] By the forward (and backward) procedure discussed in Section |4| we find the following 
index sets for constructing semiparametric models for modeling CPI. The one without overlapping is 
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presented on the LHS, and that allowing overlapping is presented on the RHS. 

Ai = 108 - 110, 116, 118, 121 - 124, 126 A[ = 108 - 110, 116, 118, 121 - 124, 126 

^2 = 119,125 ^'2 = 121-124,115 

^3 = 120, 127 ^3 = 121 - 123, 119 

A = 115 ^1 = 121-123,127 

^5 = 121, 123, 125 

A'e = 120, 121, 123 

Notice that A'l = Ai, and the main difference between these two methods comes from A2 — A4 
and A2 — A'q, i.e. how to allocate the 119, 125, 120, 127, 115th variables, which originally results 
from the rank correlations between them and the 121 — 124th variables. As we can see from Table 
|2| the 121 - 124th variables are very close to the y variable: CPI-U: All Items (82-84=100, SA) 
except one item (food, shelter or medical care) or the implicit price deflator (of personal consumption 
expenditures) . 

Due to the identification and estimation problems we discussed before, from now on, we mainly 
concentrate on the disjoint index sets case and suggest the following semiparametric model for mod- 
eling CPI: E{CPIu4) = 



91 [PiosPWFCSAios + ^w9PWIMSAiog + (3noPWCMSAuo + PiwPUSAue + /SiigPf/Ciis 
+ ^uiPUXFui + P122PUXHS122 + Pi2zPUXMi2z + Pi2iGMDCi2i + f^ueGMDCNue) 
+ 92 [PiwPUCDiig + Pi2^GMDCDi2^^ + 93 {P120PUS120 + Pi27GMDCSi2t) + 9a [pUSSu^) , (10) 

where gi, . . . ,g4 are the unknown link functions to be estimated nonparametrically and Piqs, ■ ■ ■ , /3i27 
are unknown parameters which belong to the parameter space. The variables PU CDug and GMDCD125 
denote the consumer price index and implicit price deflator (of personal consumption expenditures) 
for durable goods respectively. Thus A2 could be interpreted as the index set for durable goods. 
Very similarly, ^3 and A4, could be interpreted as the index sets for service, and apparel and upkeep 
respectively which are also important factors affecting consumer price index. All common factors 
strongly associated with CPI are included in Ai. Compared with the linear, additive or single index 



models, the model (10) actually combines flexibility in statistical modeling and interpretability from 
an economic point of view, while being kept close to the data's complex spatial structure. 



We further employ the groupwise dimension reduction method in |Li et al. (2010) to estimate 
(10) and present the parametric coefficients' estimate in Table [l] (3rd column). Contradicting to 
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the background knowledge of economics and the positive Spearman's rank correlations between CPI 
and PUXHS122, GMDC124, GMDCN126, PUCDng, GMDCD125, PUSuo shown in Table § their 
corresponding parametric coefficients are estimated to be negative. This means that the consumer 
price index negatively depends on them, which is unlikely to be true. Finally we apply the modified 



procedure with sign constraints to estimate (10) again and present the corresponding parametric 



coefficients' estimates in the last column of Table |2] with the explained variation 85.8%. While 
/3ii9, /3i25 and /3i2o are estimated positively, /3i22, /3i24 and /3i26 are estimated to be 0, which means 
PUXHS122, GMDCi2i and GMDCN12Q could be eliminated from the model. To compare with the 
linear models and see the advantages of semiparametrics, we also consider the linear model using all 
other 130 variables (except CPI itself). The explained variation w.r.t. the LARS estimate (least angle 
regression, developed by |Efron et aL] ( |2004l )) is 80.4%. 



Besides the measure of prices, other variables of special interest include a measure of real economic 



activity and a monetary policy instrument. As in Christiano et al. (1999), we use employment as 
an indicator of real economic activity measured by the number of employees on non-farm payrolls 
(EMPL). The monetary policy instrument is the Federal Funds Rate (FFR). If we apply the SCE 
approach to estimate EMPL and FFR, the explained variation is 99.9% and 97.6% respectively, while 
the corresponding LARS estimates' is 99.6% and 86.7%. These results are summarized in Table [3] 
Thus we see that we could reduce the SSE approximately by 27.4% for CPI and 82.0% for FFR 
through considering the (fiexible and proper) semiparametrics. The improvement for EMPL is not 
significant since the LARS estimate has already performed quite well. 





CPI 


EMPL 


FFR 


(SCE) 


85.8% 


99.9% 


97.6% 


i?2 (LARS) 


80.4% 


99.6% 


86.7% 



Table 3: Explained variation of the SCE and LARS estimates for CPI, EMPL and FFR. 



6 Concluding Remarks and Discussions 

In this paper, we consider estimating a large spatial covariance matrix of the generalized m dependent 
and /5-mixing time series (with J variables and T observations) by hard thresholding regularization. 
We quantify the interplay between the estimators' consistency rate and the time dependence level, 
discuss an intuitive resampling scheme for threshold selection, and prove a general cross-validation 
result that justifies this approach. Given a consistently estimated large sparse covariance matrix, 
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by utilizing the natural links among graphical models, semiparametrics and large spatial covariance 
matrix, we propose a novel forward (and backward) label permutation procedure to form a block 
diagonal structure for it and construct the corresponding low dimensional semiparametric model. 
Finally we apply this method to study the spatial structure of large panels of economic and financial 
time series to find the proper semiparametric structure for estimating the consumer price index (CPl) 
and present its superiority over the linear models. 
Choice of Threshold 

Concerning the choice of threshold in the context of time series analysis, if we are mainly targeting 
estimation performance of the corresponding semiparametric models instead of minimizing the loss 
functions ^ and ([T]) related to the covariance matrix estimation, we might directly consider min- 
imizing the estimation error based on the selected semiparametric model, for example ([2]), s.t. the 
prediction performance might be optimized. 

Other Measures of Dependence for Screening 

The information given by a Pearson's correlation coefficient is not enough to define the dependence 
structure between random variables. Except the Spearman's rank correlation we used here, distance 



correlation, Szkely et al. (2007) and Brownian covariance (correlation), Szekely and Rizzo (2009) were 
also introduced to address the deficiency of Pearson's correlation that it can be zero for dependent 
random variables; zero distance correlation and zero Brownian correlation imply independence. The 
correlation ratio is able to detect almost any functional dependency, and the entropy-based mutual 
information/total correlation is capable of detecting even more general dependencies. We want to 
point out that the Step 1 of the SCE procedure could be very easily extended to these measures 
above and the threshold value could be selected by the cross-validation procedure similarly. 



It is also noteworthy that Fan et al. (2011) considers the independence screening procedure by 
ranking the explanatory variable's importance according to the descent order of the residual sum 
of squares of the componentwise nonparametric regressions or the marginal strength of the marginal 
nonparametric regression. By doing that, they (implicitly) assume that the true semiparametric struc- 
ture is additive, which is different from our ultimate goal here: construct the proper semiparametric 
structure. 

Theoretical Study of the Screening Step 

Noticing that Fi{xi) and Fj{xj) in the formula (|8| follow the uniform distribution on [0, 1], thus ^ 
could be simplified as Pxi,x = 12 E{Fj(xi)Fj(a;j)} — 3. Similar to the "sure independence screening" 



property of Fan and Lv (2008) using Pearson's correlation for variable screening of linear models, to 
study the theoretical property of the screening step here based on the Spearman's rank correlation. 
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parallel to the equation (20) "w = y = X'X/S + X^e" of Fan and Lv (2008), we could define 



OJ = (wi, . . . , 



= 12F,{x,)Fj{xj) - 3 = \'2F,[x,)Fy{y) - 3, 1 ^ j ^ J - 1 



= 12F,(x,)F,| J2 9sWJ^aJ + So} - 3, (11) 
where Eq is the (conditional) mean-zero error term from approximating y by ^23=1 9s{(^J ^aJ 



Similar to the idea of the (group) MAVE method of Xia et al. (2002), Li et al. (2010), we notice that 



provided by g'sif^J^As) is well defined. Thus applying the Taylor expansion to J2s=i d^it^J ^As) + 
at x' will help linearize it as: 

s s 
a + J29'sW^'''as)I^J(^ ~ ^')as + 0{J2i^ - - ^')as} + ^0 



def 



S 



s=l 



+ J2bsPj{x-x')A,+e. 



s=l 



Therefore, we could rewrite (11) as 



12F,{x,)Fy\^a + J2bsPj{x - x')a^ + e} - 3. 

s=l 



(12) 



Studying the property of ( 12 ) will be the main focus. However, due to the presence of the cumulative 



density functions Fj and Fy here, this is expected to be much more complex than the Pearson's 
correlation case. We hope that the other people could further investigate this. 
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7 Appendix 



Proof of Theorem 3.1 The proof of this theorem is based on the ones of Theorem 1 and 2 in Bickel 



and Levina (2008a) up to a modification of the bound on P{maxj j lajj— cr.y| ^ s}, as remarked by their 
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subsection 2.3. By the definition of cjij in ([3]) and tlie assumption tliat for all i and j, |XtjXfj| ^ 
holds with a high probability, applying the (extended) Mcdiarmid inequality, see Theorem 2.1 of 



Janson 



(2004), to the sum of dependent random vectors ^1,1=1 \^tiXtj\ yields: 

P{max|<T,j | ^ s} ^ exp | - ^ | = exp{(2 - M'^) log J}, 

where st = M' A/log7^Y*(T)7T with sufficiently large M' also depending on C with XlLi M^C'/T ^ 



C and log JA:'*(T)/T = 0(1). Since equation (10) in Bickel and Levina (2008a) holds, others go 
through verbatim. This completes the proof. □ 



Proof of Theorem 3.2 The proof of this theorem is also based on the ones of Theorem 1 and 



2 in Bickel and Levina (2008a) up to a modification of the bound on P{maxj .,■ \aij — aij\ ^ s}. 
Assume the /3-mixing sequence {XtiXtj}J^i to satisfy Assumption 3.1 applying the Bernstein 



type inequality for /3-mixing random variables {XuXtj}f^i, see Theorem 4 of Doukhan (1994) [P. 36], 



yields that, Ve > 



dcf 



£2/4) and V < g ^ 1, 



P{\Y,XuXtj \ ^ sT) ^ 4exp 



t=i 



{1 ~ e)3{l + e)s'^T 
2{3{1 + e)a^ + qMsT}i 



+ 2 



V 



(13) 



■B 



To make J'^{A + B) arbitrarily small, we choose st = M' y with sufficiently large M' also 
depending on e,a^,M, log J/T = o(l), g = 3(1 + 6)a'^/{MsT), and Pmix = O {{J^+^' ^/\og JT)-^} 
with 5' > 0. Thus A and B are bounded by exp(— M'^ log J) and ,J~^'^+^"> respectively, which can be 
arbitrarily close to 0. This completes the proof. □ 



Proof of Lemma 13.21 Since 

B 

P (^J-^\tr{VtB -VJ:)\ ^ ^ P (^J-^\tr{B-^J2^^p^J -^^)\ ^ 



p=i 



and tr{XpX^ ) = triX^X,), triZp=i VX^X^ - VS) ^ J Ep=i with X^ = J-^ ^fi' applying 



B 



-1 v^J 



the same inequality as in (13) to J2p=iXp leads to 



P (^J-^\tr{VtB - FS)| ^ ^ Kiexp{-K2s'^B) 



with some constants Ki and K2. 
Consequently we also have 



P (j-^ max \tr{Vp±B - Vp^)\ ^ s) ^ 1(0 ^ s ^ x) + K^P exp{-K2S^B)l{s > x) 
\ p=i,...,p / 



(14) 
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If we integrate (14), i.e. 



E max^ ( \tr{vj^B - ^{vjT.)}\ ) + KiP / ex^{~K2s' B)ds, 



(15) 



and minimize the RHS of (15) over a; as P — t- oo, we find that the minimizer satisfies 
X = C{q,co,M)y/logP/B{l + o{l)}. Hence 

J-^E max (\tr{vj±B - E(i;jS)}|) ^ C{q,co, M)^/\ogP/B. □ 
p=i,...,p V / 



Proof of Theorem 3.3 Based on Lemma 3.2, we conclude that p(-P) from the second condition of 

p{P)^C{q,Co,M)J^\ogP/B. 



Lemma 13.11 satisfies 



Hence, Lemma 3.2 imphes that 



E\\B-'J2XpXj -m,<:c,p{P). 
p=l 

Hence, if we select Bt = Te{T, J) and logP = o{T''/^co{J)J'\logjy-'i/^e{T, J)}, the conditions of 
Lemma [3. II are satisfied and Theorem 13.31 follows. □ 
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